Processing Big Data

In this project, a data of AirBnb (Netherlands) is collected from Kaggle. On this dataset, all the three types of algorithms are applied which includes Regression, Classification and Clustering.

Models that are implemented are following:

Initially, some useful packages are installed.

Installation of Packages

install.packages("e1071")
install.packages("caTools")
install.packages("corrplot")
install.packages("devtools")
install.packages("dendextend")
install.packages("tree")
install.packages("zoo")
install.packages("scales")
install.packages("ggmap")
install.packages("stringr")
install.packages("gridExtra")
install.packages("caret")
install.packages("treemap")
install.packages("psych")
install.packages("DAAG")
install.packages("leaps")
install.packages("corrplot")
install.packages("glmnet")
install.packages("boot")
install.packages("naniar")
install.packages("tidyr")
install.packages("DT")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("kableExtra")
install.packages("lubridate")
install.packages("readxl")
install.packages("highcharter")
install.packages("scales")
install.packages("RColorBrewer")
install.packages("wesanderson")
install.packages("plotly")
install.packages("shiny")
install.packages("readr")
install.packages("choroplethr")
install.packages("choroplethrMaps")
install.packages("GGally")
install.packages("ade4")
install.packages("data.table")

After installing all the useful libraries, next step is to load the libraries in order to use it throughout the analysis.

Loading Packages

Import Dataset in CSV format

Import dataset and saving it to a variable named as AirBnb in CSV format. Initial six observations are displayed to look a brief insight to all attributes and some records.

AirBnb = read.csv("AirBNB.csv")
head(AirBnb)
##   host_id host_name host_since_year host_since_anniversary      id
## 1    1662     Chloe            2008                 08-Nov  304958
## 2    3159    Daniel            2008                 Sep-24    2818
## 3    3718    Britta            2008                 Oct-19  103026
## 4    4716    Stefan            2008                 Nov-30  550017
## 5    5271     Tyler            2008                 Dec-17 4728389
## 6    5271     Tyler            2008                 Dec-17 5500954
##                   neighbourhood_cleansed      city         state zipcode
## 1                             Westerpark Amsterdam North Holland    1053
## 2 Oostelijk Havengebied - Indische Buurt Amsterdam North Holland        
## 3                 De Baarsjes - Oud-West Amsterdam Noord-Holland    1053
## 4                           Centrum-Oost Amsterdam North Holland    1017
## 5                           Centrum-West Amsterdam Noord-Holland 1016 AM
## 6                           Centrum-West Amsterdam            NH 1016 AM
##       country property_type       room_type accommodates bathrooms bedrooms
## 1 Netherlands     Apartment Entire home/apt            4         2        2
## 2 Netherlands     Apartment    Private room            2         1        1
## 3 Netherlands     Apartment Entire home/apt            4         1        1
## 4 Netherlands     Apartment Entire home/apt            2         1        1
## 5 Netherlands     Apartment Entire home/apt            6         1        2
## 6 Netherlands     Apartment    Private room            4         1        1
##   beds bed_type price guests_included extra_people minimum_nights
## 1    2 Real Bed   130               4           10              4
## 2    2 Real Bed    59               1           10              3
## 3    1 Real Bed    95               2           25              3
## 4    1 Real Bed   100               1           10              2
## 5    2 Real Bed   250               2           25              2
## 6    1 Real Bed   140               2           25              2
##   host_response_time host_response_rate number_of_reviews review_scores_rating
## 1       within a day                0.8                11                   98
## 2     within an hour                  1               108                   97
## 3 within a few hours                  1                15                   92
## 4       within a day                  1                20                   97
## 5       within a day               0.89                 1                  100
## 6       within a day                0.9                 0                   NA
##   review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1                     10                        10                     9
## 2                     10                        10                    10
## 3                      9                         9                    10
## 4                     10                        10                    10
## 5                      8                        10                     8
## 6                     NA                        NA                    NA
##   review_scores_communication review_scores_location review_scores_value
## 1                          10                     10                  10
## 2                          10                      9                  10
## 3                          10                      9                   9
## 4                          10                     10                  10
## 5                          10                     10                   6
## 6                          NA                     NA                  NA

Checking Dimensionality

dim(AirBnb)
## [1] 7833   31

In our dataset we have 31 features and 7833 observations

Column Names

Checking the names of all features, as they need to be appropriate to understand.

colnames(AirBnb)
##  [1] "host_id"                     "host_name"                  
##  [3] "host_since_year"             "host_since_anniversary"     
##  [5] "id"                          "neighbourhood_cleansed"     
##  [7] "city"                        "state"                      
##  [9] "zipcode"                     "country"                    
## [11] "property_type"               "room_type"                  
## [13] "accommodates"                "bathrooms"                  
## [15] "bedrooms"                    "beds"                       
## [17] "bed_type"                    "price"                      
## [19] "guests_included"             "extra_people"               
## [21] "minimum_nights"              "host_response_time"         
## [23] "host_response_rate"          "number_of_reviews"          
## [25] "review_scores_rating"        "review_scores_accuracy"     
## [27] "review_scores_cleanliness"   "review_scores_checkin"      
## [29] "review_scores_communication" "review_scores_location"     
## [31] "review_scores_value"

In our case all the names are good enough to read and understand.

Changing Character Data Types to Categorical Variables

As we see, in column names there are some dimensions that has some categorical data. Initially when data is loaded they are read as character data type. In order to work on such variables their Data Types needs to be converted to Factor.

AirBnb <- as.data.frame(unclass(AirBnb), stringsAsFactors = TRUE)
str(AirBnb)
## 'data.frame':    7833 obs. of  31 variables:
##  $ host_id                    : int  1662 3159 3718 4716 5271 5271 5271 5988 9616 14589 ...
##  $ host_name                  : Factor w/ 2987 levels "(email hidden)",..: 439 522 348 2644 2806 2806 2806 2343 1576 2486 ...
##  $ host_since_year            : int  2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 ...
##  $ host_since_anniversary     : Factor w/ 366 levels "01-Apr","01-Aug",..: 94 360 336 329 186 186 186 1 36 155 ...
##  $ id                         : int  304958 2818 103026 550017 4728389 5500954 5181918 2774924 23651 738245 ...
##  $ neighbourhood_cleansed     : Factor w/ 22 levels "Bijlmer-Centrum",..: 21 15 8 5 6 6 6 22 9 6 ...
##  $ city                       : Factor w/ 35 levels "Ã\201msterdam",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ state                      : Factor w/ 23 levels "","Amsterdam",..: 19 19 15 19 15 12 15 19 19 19 ...
##  $ zipcode                    : Factor w/ 3276 levels ""," ","....",..: 1429 1 1429 683 551 551 551 2243 2641 382 ...
##  $ country                    : Factor w/ 1 level "Netherlands": 1 1 1 1 1 1 1 1 1 1 ...
##  $ property_type              : Factor w/ 15 levels "Apartment","Bed & Breakfast",..: 1 1 1 1 1 1 1 9 1 9 ...
##  $ room_type                  : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 1 1 2 2 2 2 1 ...
##  $ accommodates               : int  4 2 4 2 6 4 2 2 3 2 ...
##  $ bathrooms                  : num  2 1 1 1 1 1 1 1 1 1 ...
##  $ bedrooms                   : int  2 1 1 1 2 1 1 1 1 1 ...
##  $ beds                       : int  2 2 1 1 2 1 1 1 1 1 ...
##  $ bed_type                   : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 3 5 5 5 ...
##  $ price                      : int  130 59 95 100 250 140 115 80 80 90 ...
##  $ guests_included            : int  4 1 2 1 2 2 1 1 2 1 ...
##  $ extra_people               : int  10 10 25 10 25 25 0 0 15 0 ...
##  $ minimum_nights             : int  4 3 3 2 2 2 1 3 6 3 ...
##  $ host_response_time         : Factor w/ 5 levels "a few days or more",..: 3 5 4 3 3 3 3 5 3 2 ...
##  $ host_response_rate         : Factor w/ 86 levels "0.02","0.05",..: 65 85 85 85 74 75 74 85 85 86 ...
##  $ number_of_reviews          : int  11 108 15 20 1 0 4 33 36 8 ...
##  $ review_scores_rating       : int  98 97 92 97 100 NA 95 95 96 93 ...
##  $ review_scores_accuracy     : int  10 10 9 10 8 NA 9 9 9 10 ...
##  $ review_scores_cleanliness  : int  10 10 9 10 10 NA 9 10 10 9 ...
##  $ review_scores_checkin      : int  9 10 10 10 8 NA 9 10 10 9 ...
##  $ review_scores_communication: int  10 10 10 10 10 NA 10 10 10 9 ...
##  $ review_scores_location     : int  10 9 9 10 10 NA 10 10 9 10 ...
##  $ review_scores_value        : int  10 10 9 10 6 NA 9 9 9 9 ...

Checking the Null Values in the Dataset

Now we check the number of null values and variables that consist of null values in the dataset.

sum(is.na(AirBnb))
## [1] 12051
summary(is.na(AirBnb))
##   host_id        host_name       host_since_year host_since_anniversary
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical         
##  FALSE:7833      FALSE:7833      FALSE:7833      FALSE:7833            
##                                                                        
##      id          neighbourhood_cleansed    city           state        
##  Mode :logical   Mode :logical          Mode :logical   Mode :logical  
##  FALSE:7833      FALSE:7833             FALSE:7833      FALSE:7833     
##                                                                        
##   zipcode         country        property_type   room_type      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:7833      FALSE:7833      FALSE:7833      FALSE:7833     
##                                                                 
##  accommodates    bathrooms        bedrooms          beds        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:7833      FALSE:7764      FALSE:7819      FALSE:7820     
##                  TRUE :69        TRUE :14        TRUE :13       
##   bed_type         price         guests_included extra_people   
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:7833      FALSE:7833      FALSE:7833      FALSE:7833     
##                                                                 
##  minimum_nights  host_response_time host_response_rate number_of_reviews
##  Mode :logical   Mode :logical      Mode :logical      Mode :logical    
##  FALSE:7833      FALSE:7833         FALSE:7833         FALSE:7833       
##                                                                         
##  review_scores_rating review_scores_accuracy review_scores_cleanliness
##  Mode :logical        Mode :logical          Mode :logical            
##  FALSE:6135           FALSE:6124             FALSE:6124               
##  TRUE :1698           TRUE :1709             TRUE :1709               
##  review_scores_checkin review_scores_communication review_scores_location
##  Mode :logical         Mode :logical               Mode :logical         
##  FALSE:6125            FALSE:6122                  FALSE:6124            
##  TRUE :1708            TRUE :1711                  TRUE :1709            
##  review_scores_value
##  Mode :logical      
##  FALSE:6122         
##  TRUE :1711

Total number of NA’s are 12051. In summary we can see there are some dimensions that consist of NA’s which includes bathrooms, bedrooms, beds and reviews.

Missing Values using Graphical Representation

Here we can graphically visualize the null values in each attribute

gg_miss_var(AirBnb)

Heat Plot of Missing Values

Heat plot that clearly mention the features containing null values and overall percentage of missing and present values.

vis_miss(AirBnb) + theme(axis.text.x = element_text(angle = 90))

## Handling NA’s and Null values

Now null values in the dataset needs to be handled. There are two types of data in our dataset that is numeric and non-numeric. For numeric values, NA’s in a particular feature is replaced by the mean of the total observations present in that feature. As we don’t have any NA’s present in the non-numeric variables, so we will leave them as it is.

Imputation of Numeric Variables

AirBnb <- AirBnb %>% 
  mutate(review_scores_rating = ifelse(is.na(review_scores_rating), mean(review_scores_rating,na.rm=TRUE),review_scores_rating),
         bedrooms = ifelse(is.na(bedrooms), mean(bedrooms,na.rm=TRUE),bedrooms), beds = ifelse(is.na(beds), mean(beds,na.rm=TRUE),beds),
         bathrooms = ifelse(is.na(bathrooms), mean(bathrooms,na.rm=TRUE),bathrooms))

AirBnb <- AirBnb %>% 
  mutate(review_scores_accuracy = ifelse(is.na(review_scores_accuracy), mean(review_scores_accuracy,na.rm=TRUE),review_scores_accuracy),
         review_scores_cleanliness = ifelse(is.na(review_scores_cleanliness), mean(review_scores_cleanliness,na.rm=TRUE),review_scores_cleanliness), 
         review_scores_checkin = ifelse(is.na(review_scores_checkin), mean(review_scores_checkin,na.rm=TRUE),review_scores_checkin),
         review_scores_communication = ifelse(is.na(review_scores_communication), mean(review_scores_communication,na.rm=TRUE),review_scores_communication),
         review_scores_location = ifelse(is.na(review_scores_location), mean(review_scores_location,na.rm=TRUE),review_scores_location),
         review_scores_value = ifelse(is.na(review_scores_value), mean(review_scores_value,na.rm=TRUE),review_scores_value))


AirBnb <- AirBnb %>%
  mutate(host_response_rate = ifelse(host_response_rate== "NA", mean(host_response_rate), host_response_rate),
         host_response_time = ifelse(host_response_time== "NA", NA, host_response_time))

Visualization of Dataset

Here we again visualize the dataset after handling Null Values from the the dataset.

gg_miss_var(AirBnb)

vis_miss(AirBnb) + theme(axis.text.x = element_text(angle = 90))

It’s now confirm that our dataset has all the data present and there is no missing values anymore.

Summary of the loaded Dataset

summary(AirBnb)
##     host_id               host_name    host_since_year host_since_anniversary
##  Min.   :    1662   Douwe&Niki :  91   Min.   :2008    Jun-19 : 118          
##  1st Qu.: 3430410   Jorrit&Dirk:  72   1st Qu.:2012    02-May :  95          
##  Median : 7392601   Myra       :  59   Median :2013    Aug-21 :  90          
##  Mean   : 9879849   Peter      :  50   Mean   :2013    12-Feb :  51          
##  3rd Qu.:15054166   Michiel    :  49   3rd Qu.:2014    10-Sep :  49          
##  Max.   :30595041   Anne       :  43   Max.   :2015    Aug-31 :  45          
##                     (Other)    :7469                   (Other):7385          
##        id                      neighbourhood_cleansed                 city     
##  Min.   :   2818   Centrum-West           :1426       Amsterdam         :7702  
##  1st Qu.:1309364   De Baarsjes - Oud-West :1203       Amsterdam-Zuidoost:  35  
##  Median :2964891   Centrum-Oost           : 920       Diemen            :  14  
##  Mean   :2926936   De Pijp - Rivierenbuurt: 906       Jordaan           :  13  
##  3rd Qu.:4473450   Westerpark             : 689       Watergraafsmeer   :   9  
##  Max.   :5897527   Zuid                   : 579       Ã\201msterdam        :   7  
##                    (Other)                :2110       (Other)           :  53  
##            state         zipcode            country             property_type 
##  North Holland:5761   1054   : 209   Netherlands:7833   Apartment      :6280  
##  Noord-Holland:1877   1015   : 181                      House          : 711  
##  NH           : 159   1017   : 176                      Bed & Breakfast: 370  
##               :   8          : 173                      Boat           : 327  
##  Noord Holland:   5   1053   : 155                      Loft           :  77  
##  Amsterdam    :   3   1013   : 149                      Other          :  29  
##  (Other)      :  20   (Other):6790                      (Other)        :  39  
##            room_type     accommodates      bathrooms        bedrooms     
##  Entire home/apt:6305   Min.   : 1.000   Min.   :0.000   Min.   : 0.000  
##  Private room   :1482   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.: 1.000  
##  Shared room    :  46   Median : 2.000   Median :1.000   Median : 1.000  
##                         Mean   : 3.115   Mean   :1.113   Mean   : 1.415  
##                         3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 2.000  
##                         Max.   :16.000   Max.   :8.000   Max.   :10.000  
##                                                                          
##       beds                 bed_type        price      guests_included 
##  Min.   : 1.000   Airbed       :  13   Min.   :  15   Min.   : 0.000  
##  1st Qu.: 1.000   Couch        :  11   1st Qu.:  85   1st Qu.: 1.000  
##  Median : 1.000   Futon        :  26   Median : 109   Median : 1.000  
##  Mean   : 1.984   Pull-out Sofa:  94   Mean   : 129   Mean   : 1.642  
##  3rd Qu.: 2.000   Real Bed     :7689   3rd Qu.: 150   3rd Qu.: 2.000  
##  Max.   :16.000                        Max.   :9000   Max.   :16.000  
##                                                                       
##   extra_people    minimum_nights   host_response_time host_response_rate
##  Min.   :  0.00   Min.   : 1.000   Min.   :1.000      Min.   : 1.00     
##  1st Qu.:  0.00   1st Qu.: 1.000   1st Qu.:3.000      1st Qu.:75.00     
##  Median :  0.00   Median : 2.000   Median :4.000      Median :85.00     
##  Mean   : 13.62   Mean   : 2.509   Mean   :3.756      Mean   :76.83     
##  3rd Qu.: 25.00   3rd Qu.: 3.000   3rd Qu.:5.000      3rd Qu.:85.00     
##  Max.   :235.00   Max.   :27.000   Max.   :5.000      Max.   :86.00     
##                                                                         
##  number_of_reviews review_scores_rating review_scores_accuracy
##  Min.   :  0.00    Min.   : 20.00       Min.   : 2.000        
##  1st Qu.:  1.00    1st Qu.: 92.00       1st Qu.: 9.000        
##  Median :  5.00    Median : 93.34       Median : 9.447        
##  Mean   : 13.83    Mean   : 93.34       Mean   : 9.447        
##  3rd Qu.: 15.00    3rd Qu.: 98.00       3rd Qu.:10.000        
##  Max.   :297.00    Max.   :100.00       Max.   :10.000        
##                                                               
##  review_scores_cleanliness review_scores_checkin review_scores_communication
##  Min.   : 2.00             Min.   : 2.000        Min.   : 2.000             
##  1st Qu.: 9.00             1st Qu.: 9.639        1st Qu.: 9.698             
##  Median : 9.29             Median :10.000        Median :10.000             
##  Mean   : 9.29             Mean   : 9.639        Mean   : 9.698             
##  3rd Qu.:10.00             3rd Qu.:10.000        3rd Qu.:10.000             
##  Max.   :10.00             Max.   :10.000        Max.   :10.000             
##                                                                             
##  review_scores_location review_scores_value
##  Min.   : 2.000         Min.   : 2.00      
##  1st Qu.: 9.000         1st Qu.: 9.00      
##  Median : 9.293         Median : 9.00      
##  Mean   : 9.293         Mean   : 9.04      
##  3rd Qu.:10.000         3rd Qu.: 9.04      
##  Max.   :10.000         Max.   :10.00      
## 

Visulaizing the Data

Visualizing data in terms of no. of dimensions, no. of observations, data types and all the column names using Glimpse

glimpse(AirBnb)
## Rows: 7,833
## Columns: 31
## $ host_id                     <int> 1662, 3159, 3718, 4716, 5271, 5271, 5271, ~
## $ host_name                   <fct> "Chloe", "Daniel", "Britta", "Stefan", "Ty~
## $ host_since_year             <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, ~
## $ host_since_anniversary      <fct> 08-Nov, Sep-24, Oct-19, Nov-30, Dec-17, De~
## $ id                          <int> 304958, 2818, 103026, 550017, 4728389, 550~
## $ neighbourhood_cleansed      <fct> Westerpark, Oostelijk Havengebied - Indisc~
## $ city                        <fct> "Amsterdam", "Amsterdam", "Amsterdam", "Am~
## $ state                       <fct> North Holland, North Holland, Noord-Hollan~
## $ zipcode                     <fct> 1053, , 1053, 1017, 1016 AM, 1016 AM, 1016~
## $ country                     <fct> Netherlands, Netherlands, Netherlands, Net~
## $ property_type               <fct> Apartment, Apartment, Apartment, Apartment~
## $ room_type                   <fct> Entire home/apt, Private room, Entire home~
## $ accommodates                <int> 4, 2, 4, 2, 6, 4, 2, 2, 3, 2, 3, 3, 2, 3, ~
## $ bathrooms                   <dbl> 2.000000, 1.000000, 1.000000, 1.000000, 1.~
## $ bedrooms                    <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, ~
## $ beds                        <dbl> 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, ~
## $ bed_type                    <fct> Real Bed, Real Bed, Real Bed, Real Bed, Re~
## $ price                       <int> 130, 59, 95, 100, 250, 140, 115, 80, 80, 9~
## $ guests_included             <int> 4, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, ~
## $ extra_people                <int> 10, 10, 25, 10, 25, 25, 0, 0, 15, 0, 30, 0~
## $ minimum_nights              <int> 4, 3, 3, 2, 2, 2, 1, 3, 6, 3, 7, 3, 3, 4, ~
## $ host_response_time          <int> 3, 5, 4, 3, 3, 3, 3, 5, 3, 2, 4, 3, 5, 3, ~
## $ host_response_rate          <int> 65, 85, 85, 85, 74, 75, 74, 85, 85, 86, 85~
## $ number_of_reviews           <int> 11, 108, 15, 20, 1, 0, 4, 33, 36, 8, 3, 2,~
## $ review_scores_rating        <dbl> 98.0000, 97.0000, 92.0000, 97.0000, 100.00~
## $ review_scores_accuracy      <dbl> 10.00000, 10.00000, 9.00000, 10.00000, 8.0~
## $ review_scores_cleanliness   <dbl> 10.000000, 10.000000, 9.000000, 10.000000,~
## $ review_scores_checkin       <dbl> 9.000000, 10.000000, 10.000000, 10.000000,~
## $ review_scores_communication <dbl> 10.000000, 10.000000, 10.000000, 10.000000~
## $ review_scores_location      <dbl> 10.000000, 9.000000, 9.000000, 10.000000, ~
## $ review_scores_value         <dbl> 10.000000, 10.000000, 9.000000, 10.000000,~

Exploratory Data Analysis

Analysis of neighbourhood_cleansed

This pie chart is used to find the types of neighbour hood group in Netherland along with their percentages.

property_type_d <- data.frame(table(AirBnb$property_type))
property_type_data <- property_type_d[,c('Var1', 'Freq')]
fig <- plot_ly(property_type_data, labels = ~Var1, values = ~Freq, type = 'pie')
fig
80.2%9.08%4.72%4.17%0.983%0.37%0.153%0.14%0.102%0.0255%0.0255%0.0128%0.0128%0.0128%0.0128%
ApartmentHouseBed & BreakfastBoatLoftOtherCabinCamper/RVVillaDormYurtChaletEarth HouseHutTreehouse

Type of Listings present in each Neighbourhood Group

# Group neighbourhood_cleansed variable with room_type.
property_df <-  AirBnb %>% 
  group_by(neighbourhood_cleansed, room_type) %>% 
  summarize(Freq = n())

# Filtering room_type and grouping it with particular neighbourhood_cleansed
total_property <-  AirBnb %>% 
    filter(room_type %in% c("Private room","Entire home/apt","Shared room")) %>% 
    group_by(neighbourhood_cleansed) %>% 
    summarize(sum = n())

# Merging both variables in order to visualize and plot
property_ratio <- merge (property_df, total_property, by="neighbourhood_cleansed")

property_ratio <- property_ratio %>% 
  mutate(ratio = Freq/sum)

# Plot listings present in each neighbourhood group
ggplot(property_ratio, aes(x=neighbourhood_cleansed, y = ratio, fill = room_type)) + geom_bar(position = "dodge", stat="identity") + 
  xlab("Neighbourhood Cleansed") + ylab ("Property Count") +
  scale_fill_discrete(name = "Property Type") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip()

Above graph shows the percentage of each listing in each neighbour hood cleansed. Furthermore, it gives insight that ‘Shared Room’ listings are amateur in all the groups. On the other hand ‘Private Room’ listings are most popular in each group except in Manhattan group.

Price comparison among each Neighbour Hood Group.

AirBnb %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(neighbourhood_cleansed, mean_price), y = mean_price, fill = neighbourhood_cleansed)) +
  geom_col(stat ="identity", color = "black", fill="maroon") +
  coord_flip() +
  theme_gray() +
  labs(x = "Neighbourhood Group", y = "Price") +
  geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Mean Price comparison for each Neighbourhood Group", subtitle = "Price vs Neighbourhood Group") + 
  xlab("Neighbourhood Group") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 1),
        plot.subtitle = element_text(color = "black", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Price analysis of room type

AirBnb %>% 
  filter(!(is.na(room_type))) %>% 
  filter(!(room_type == "Unknown")) %>% 
  group_by(room_type) %>% 
  summarise(mean_price = mean(price, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(room_type, mean_price), y = mean_price, fill = room_type)) +
  geom_col(stat ="identity", color = "black", fill="orange") +
  coord_flip() +
  theme_gray() +
  labs(x = "Room Type", y = "Price") +
  geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "black", size = 3.5) +
  ggtitle("Mean Price comparison with all Room Types", subtitle = "Price vs Room Type") + 
  xlab("Room Type") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "black", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())
## Warning: Ignoring unknown parameters: stat

Coorelation Matrix of numeric dimensions

Correlation plot is made to find relationship among features.

airbnb.corr <- AirBnb %>% 
  select(price, minimum_nights, accommodates, bathrooms, bedrooms, beds, guests_included, extra_people)

cor(airbnb.corr) # get the correlation matrix
##                      price minimum_nights accommodates  bathrooms   bedrooms
## price           1.00000000     0.01903058   0.34302041 0.22020643 0.34534540
## minimum_nights  0.01903058     1.00000000   0.01783162 0.03515667 0.08472229
## accommodates    0.34302041     0.01783162   1.00000000 0.44742126 0.70468281
## bathrooms       0.22020643     0.03515667   0.44742126 1.00000000 0.43230198
## bedrooms        0.34534540     0.08472229   0.70468281 0.43230198 1.00000000
## beds            0.31670780     0.04521712   0.82401499 0.46935595 0.70831706
## guests_included 0.23804450     0.03152692   0.51068034 0.23791043 0.43861282
## extra_people    0.11928948    -0.04788970   0.32452077 0.12239996 0.19291047
##                       beds guests_included extra_people
## price           0.31670780      0.23804450    0.1192895
## minimum_nights  0.04521712      0.03152692   -0.0478897
## accommodates    0.82401499      0.51068034    0.3245208
## bathrooms       0.46935595      0.23791043    0.1224000
## bedrooms        0.70831706      0.43861282    0.1929105
## beds            1.00000000      0.45365983    0.2282370
## guests_included 0.45365983      1.00000000    0.4400605
## extra_people    0.22823700      0.44006047    1.0000000
corrplot(cor(airbnb.corr), method = "number", type = "lower", bg = "grey") # put this in a nice table

Regression Models

We are going to implement two regression models. First one is linar regression and second is multiple regression.

Simple Linear Regression

In linear regression model, variable which is going to be predict is price while the the predictor is accommodates. For simple linear regression, Does number of accommodates make an impact on price or not?

Price vs Accommodates graph has been drawn to visualize the trend among the features.

ggplot(data = AirBnb, mapping = aes(x = accommodates, y = price)) +
  geom_jitter() # jitter instead of points, otherwise many dots get drawn over each other

After visualizing points, now drawing a regression line that best suits the points and has minimum r-squared value.

ggplot(data = AirBnb, mapping = aes(x = accommodates, y = log(price, base = exp(1)))) +
  geom_jitter() + # jitter instead of points, otherwise many dots get drawn over each other
  stat_summary(fun.y=mean, colour="green", size = 4, geom="point", shape = 23, fill = "green") + # means
  stat_smooth(method = "lm", se=FALSE) # regression line
## Warning: `fun.y` is deprecated. Use `fun` instead.
## `geom_smooth()` using formula 'y ~ x'

We create a linear model. The first argument is the model which takes the form of dependent variable ~ independent variable(s). The second argument is the data we should consider.

linearmodel <- lm(price ~ accommodates, data = AirBnb) 

Plot linear model to visualize stats of the model

par(mfrow=c(2,2)) 
plot(linearmodel)

Summary of the linear model to check parameters like p-value, r-square, adjusted r-squared

summary(linearmodel) # ask for a summary of this linear model
## 
## Call:
## lm(formula = price ~ accommodates, data = AirBnb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -416.0  -32.2  -11.1   22.9 8898.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   51.1792     2.7654   18.51   <2e-16 ***
## accommodates  24.9890     0.7733   32.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 120.3 on 7831 degrees of freedom
## Multiple R-squared:  0.1177, Adjusted R-squared:  0.1176 
## F-statistic:  1044 on 1 and 7831 DF,  p-value: < 2.2e-16

Multiple Linear Regression

Spliting Data into Training and Testing Data and removing outliers in Price

In order to remove outliers, extreme values of price from lower and upper bound both.

AirBnb_filtered_data <- AirBnb %>%
  filter(price < quantile(AirBnb$price, 0.9) & price > quantile(AirBnb$price, 0.1))

Storing training data in “training_data” and testing data in “testing_data”. We split data in the ration 50:50

set.seed(12345)
AirBnb_filtered_data <- AirBnb_filtered_data %>% mutate(id = row_number())
training_data <- AirBnb_filtered_data %>% sample_frac(.5) %>% filter(price > 0)
testing_data <- anti_join(AirBnb_filtered_data, training_data, by = 'id') %>% filter(price > 0)

Checking the splitting of data is done correctly or not as we filter the data by omitting extreme values. Adding test and train data together it will be equal to the original data. This is a sanity check.

nrow(training_data) + nrow(testing_data) == nrow(AirBnb_filtered_data %>% filter(price > 0))
## [1] TRUE

Variable selection model is used to select the appropriate variables for the model. Here I used Best Subset Regression Method.

best_fit_model <- regsubsets (price ~neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + guests_included + extra_people + property_type + room_type + number_of_reviews, data = training_data, nbest = 2, nvmax = 11)

summary(best_fit_model)
plot(best_fit_model, scale="bic")

According to variable selection method output, we consider neighourhood_cleansed, minimum_nights, property_type, accommodates, beds, bedrooms, bathrooms, extra_people and number of views.

Linear Model Training with Training Data Set

Now a model is created with the best variable selected by method.

Linear_Model<-lm(price ~ neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + extra_people + room_type +number_of_reviews, data = training_data)

Linear_Model_Summary <- summary(Linear_Model)
Linear_Model_MSE <- Linear_Model_Summary$sigma^2
Linear_Model_RSQ <- Linear_Model_Summary$r.squared
Linear_Model_ARSQ <- Linear_Model_Summary$adj.r.squared

Linear_Model_Summary
## 
## Call:
## lm(formula = price ~ neighbourhood_cleansed + minimum_nights + 
##     accommodates + bathrooms + bedrooms + beds + extra_people + 
##     room_type + number_of_reviews, data = training_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -201.014  -19.030   -4.214   16.249   90.056 
## 
## Coefficients:
##                                                               Estimate
## (Intercept)                                                   74.59828
## neighbourhood_cleansedBijlmer-Oost                            -5.17820
## neighbourhood_cleansedBos en Lommer                           -3.54984
## neighbourhood_cleansedBuitenveldert - Zuidas                   0.76130
## neighbourhood_cleansedCentrum-Oost                            22.95496
## neighbourhood_cleansedCentrum-West                            29.69401
## neighbourhood_cleansedDe Aker - Nieuw Sloten                  -3.39186
## neighbourhood_cleansedDe Baarsjes - Oud-West                   9.98905
## neighbourhood_cleansedDe Pijp - Rivierenbuurt                 10.41505
## neighbourhood_cleansedGaasperdam - Driemond                    3.66337
## neighbourhood_cleansedGeuzenveld - Slotermeer                 -6.84094
## neighbourhood_cleansedIJburg - Zeeburgereiland                 2.46950
## neighbourhood_cleansedNoord-Oost                             -10.80096
## neighbourhood_cleansedNoord-West                               3.88900
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt  -3.17246
## neighbourhood_cleansedOsdorp                                  -5.97466
## neighbourhood_cleansedOud-Noord                                2.69579
## neighbourhood_cleansedOud-Oost                                 3.16678
## neighbourhood_cleansedSlotervaart                              1.71980
## neighbourhood_cleansedWatergraafsmeer                          7.63674
## neighbourhood_cleansedWesterpark                               7.19021
## neighbourhood_cleansedZuid                                    10.12776
## minimum_nights                                                -0.60984
## accommodates                                                   3.73188
## bathrooms                                                      3.54347
## bedrooms                                                      12.69799
## beds                                                           0.87705
## extra_people                                                   0.01790
## room_typePrivate room                                        -17.23587
## room_typeShared room                                         -26.53055
## number_of_reviews                                             -0.14796
##                                                              Std. Error t value
## (Intercept)                                                     9.75751   7.645
## neighbourhood_cleansedBijlmer-Oost                             21.37608  -0.242
## neighbourhood_cleansedBos en Lommer                             9.88859  -0.359
## neighbourhood_cleansedBuitenveldert - Zuidas                   10.81780   0.070
## neighbourhood_cleansedCentrum-Oost                              9.72979   2.359
## neighbourhood_cleansedCentrum-West                              9.69183   3.064
## neighbourhood_cleansedDe Aker - Nieuw Sloten                   12.14670  -0.279
## neighbourhood_cleansedDe Baarsjes - Oud-West                    9.69919   1.030
## neighbourhood_cleansedDe Pijp - Rivierenbuurt                   9.72624   1.071
## neighbourhood_cleansedGaasperdam - Driemond                    21.35263   0.172
## neighbourhood_cleansedGeuzenveld - Slotermeer                  13.15579  -0.520
## neighbourhood_cleansedIJburg - Zeeburgereiland                 10.88402   0.227
## neighbourhood_cleansedNoord-Oost                               14.00540  -0.771
## neighbourhood_cleansedNoord-West                               11.63768   0.334
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt    9.90415  -0.320
## neighbourhood_cleansedOsdorp                                   13.18808  -0.453
## neighbourhood_cleansedOud-Noord                                10.12277   0.266
## neighbourhood_cleansedOud-Oost                                  9.84387   0.322
## neighbourhood_cleansedSlotervaart                              10.40983   0.165
## neighbourhood_cleansedWatergraafsmeer                          10.20604   0.748
## neighbourhood_cleansedWesterpark                                9.76404   0.736
## neighbourhood_cleansedZuid                                      9.81053   1.032
## minimum_nights                                                  0.28263  -2.158
## accommodates                                                    0.67443   5.533
## bathrooms                                                       1.68401   2.104
## bedrooms                                                        0.98920  12.837
## beds                                                            0.71194   1.232
## extra_people                                                    0.02968   0.603
## room_typePrivate room                                           1.50652 -11.441
## room_typeShared room                                            9.71258  -2.732
## number_of_reviews                                               0.01993  -7.425
##                                                              Pr(>|t|)    
## (Intercept)                                                  2.80e-14 ***
## neighbourhood_cleansedBijlmer-Oost                            0.80861    
## neighbourhood_cleansedBos en Lommer                           0.71963    
## neighbourhood_cleansedBuitenveldert - Zuidas                  0.94390    
## neighbourhood_cleansedCentrum-Oost                            0.01838 *  
## neighbourhood_cleansedCentrum-West                            0.00220 ** 
## neighbourhood_cleansedDe Aker - Nieuw Sloten                  0.78008    
## neighbourhood_cleansedDe Baarsjes - Oud-West                  0.30315    
## neighbourhood_cleansedDe Pijp - Rivierenbuurt                 0.28434    
## neighbourhood_cleansedGaasperdam - Driemond                   0.86379    
## neighbourhood_cleansedGeuzenveld - Slotermeer                 0.60311    
## neighbourhood_cleansedIJburg - Zeeburgereiland                0.82052    
## neighbourhood_cleansedNoord-Oost                              0.44065    
## neighbourhood_cleansedNoord-West                              0.73827    
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt  0.74875    
## neighbourhood_cleansedOsdorp                                  0.65056    
## neighbourhood_cleansedOud-Noord                               0.79002    
## neighbourhood_cleansedOud-Oost                                0.74770    
## neighbourhood_cleansedSlotervaart                             0.86879    
## neighbourhood_cleansedWatergraafsmeer                         0.45436    
## neighbourhood_cleansedWesterpark                              0.46155    
## neighbourhood_cleansedZuid                                    0.30200    
## minimum_nights                                                0.03103 *  
## accommodates                                                 3.41e-08 ***
## bathrooms                                                     0.03545 *  
## bedrooms                                                      < 2e-16 ***
## beds                                                          0.21808    
## extra_people                                                  0.54645    
## room_typePrivate room                                         < 2e-16 ***
## room_typeShared room                                          0.00634 ** 
## number_of_reviews                                            1.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.93 on 2973 degrees of freedom
## Multiple R-squared:  0.3016, Adjusted R-squared:  0.2946 
## F-statistic:  42.8 on 30 and 2973 DF,  p-value: < 2.2e-16

MSE, R-Squared and Adjusted R-Squared of the Model are respectively.

Linear_Model_MSE
## [1] 725.4168
Linear_Model_RSQ
## [1] 0.3016399
Linear_Model_ARSQ
## [1] 0.2945928

Plot Linear Regression Model

par(mfrow=c(2,2)) 
plot(Linear_Model)

Residuals vs fitted values shows that the dots are not evenly distributed around zero and do not show a constant variance around X. This means that linearity and equal variance assumptions are not satisifed.

QQ plot shows a 45 degree line meaning that Nomrality assumptions are met.

Testing our Linear Model with Testing Data set

Linear_Model_Test <- predict(object = Linear_Model, newdata = testing_data)

Now calculating MSE for Test Data

mean((Linear_Model_Test - testing_data$price)^2)
## [1] 739.5444

Calculating MSPE for filtered data set

Linear_Model_FD <-  glm (price ~ neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + extra_people + room_type +number_of_reviews, data = AirBnb_filtered_data)

cv.glm(data= AirBnb_filtered_data, glmfit = Linear_Model_FD, K = 3)$delta[2]
## [1] 734.8276

Comparing MSE of the filtered data which is almost equals to 735 and the MSE of the test data is 739 which is very near to the value of fileterd data MSE. So variables selected for model are good predictors.

Classification

Based on the property characteristics and various parameters is the price high for particular property?

Decision Tree

Loading Tree package

require(tree)
## Loading required package: tree
## Warning: package 'tree' was built under R version 4.1.2
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli

For classification, we need a discrete variable for classification algorithm. In our case, target variable is price. We made another variable named as price_cat and categorize the price into “Cheap” and “Expensive”. Price is categorize on the basis of mean of price, if price is greater than mean price, it is assigned to EXPENSIVE category and if less than mean price then the particular value is assigned to CHEAP category.

New feature price_cat is attached with the original data and change the data type to factor for further processing.

AirBnb_filtered_data_cat <- AirBnb %>%
  mutate(price_cat = ifelse(price <= mean(price),"Cheap","Expensive"))

AirBnb_filtered_data_cat = data.frame(AirBnb_filtered_data_cat, AirBnb_filtered_data_cat$price_cat)

AirBnb_filtered_data_cat$price_cat = as.factor(AirBnb_filtered_data_cat$price_cat)

We will drop the variables which are not important including price, as we can’t have price variable here because pur response variable price_Cat is created from price.

Afterwards, we will fit our model using AirBnb_filtered_data_cat, by setting the target variable i.e. price_cat.

AirBnb_filtered_data_cat = select(AirBnb_filtered_data_cat, -c(price,host_id,host_name,host_since_year,host_since_anniversary,id,city,country,state,zipcode))

tree.AirBnb_filtered_data_cat = tree(price_cat~., data = AirBnb_filtered_data_cat)

In summary we can see the terminal nodes, the residual mean deviance and missclassification error rate.

summary(tree.AirBnb_filtered_data_cat)
## 
## Classification tree:
## tree(formula = price_cat ~ ., data = AirBnb_filtered_data_cat)
## Variables actually used in tree construction:
## [1] "accommodates"           "neighbourhood_cleansed" "room_type"             
## [4] "bedrooms"               "guests_included"        "extra_people"          
## [7] "review_scores_location"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.9146 = 7157 / 7825 
## Misclassification error rate: 0.2054 = 1609 / 7833
  • Residual mean deviance = 0.9146
  • Missclassification Error Rate = 0.2054

Now plot the tree for better visuals

plot(tree.AirBnb_filtered_data_cat)
text(tree.AirBnb_filtered_data_cat, pretty = 0)

tree.AirBnb_filtered_data_cat
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 7833 10070.0 Cheap ( 0.65760 0.34240 )  
##    2) accommodates < 3.5 4929  4597.0 Cheap ( 0.82329 0.17671 )  
##      4) neighbourhood_cleansed: Bijlmer-Centrum,Bijlmer-Oost,Bos en Lommer,Buitenveldert - Zuidas,De Aker - Nieuw Sloten,De Baarsjes - Oud-West,De Pijp - Rivierenbuurt,Gaasperdam - Driemond,Geuzenveld - Slotermeer,IJburg - Zeeburgereiland,Noord-Oost,Noord-West,Oostelijk Havengebied - Indische Buurt,Osdorp,Oud-Noord,Oud-Oost,Slotervaart,Watergraafsmeer,Westerpark,Zuid 3563  2457.0 Cheap ( 0.89082 0.10918 )  
##        8) room_type: Private room,Shared room 1073   166.3 Cheap ( 0.98509 0.01491 ) *
##        9) room_type: Entire home/apt 2490  2103.0 Cheap ( 0.85020 0.14980 ) *
##      5) neighbourhood_cleansed: Centrum-Oost,Centrum-West 1366  1774.0 Cheap ( 0.64714 0.35286 ) *
##    3) accommodates > 3.5 2904  3846.0 Expensive ( 0.37638 0.62362 )  
##      6) bedrooms < 1.20744 784  1028.0 Cheap ( 0.63648 0.36352 ) *
##      7) bedrooms > 1.20744 2120  2515.0 Expensive ( 0.28019 0.71981 )  
##       14) guests_included < 3.5 1517  1964.0 Expensive ( 0.35003 0.64997 )  
##         28) extra_people < 2.5 697   649.6 Expensive ( 0.17647 0.82353 ) *
##         29) extra_people > 2.5 820  1137.0 Expensive ( 0.49756 0.50244 )  
##           58) review_scores_location < 9.14647 347   422.1 Cheap ( 0.70317 0.29683 ) *
##           59) review_scores_location > 9.14647 473   610.5 Expensive ( 0.34672 0.65328 ) *
##       15) guests_included > 3.5 603   403.8 Expensive ( 0.10448 0.89552 ) *

Each node is labeled with Yes or No with specific threshold value.

Now we split our data into ration 80:20. Now we refit the model with tree but this time we will use training dataset.

set.seed(100)

train = sample(1:nrow(AirBnb_filtered_data_cat), 5000)

tree.AirBnb = tree(price_cat~., AirBnb_filtered_data_cat, subset = train)

Plot the tree model fitted with training dataset.

plot(tree.AirBnb)
text(tree.AirBnb, pretty = 0)

Next Step is to do prediction, whether our model is predicting good or not. Afterwards we evaluate the error using a missclassification table.

tree.pred = predict(tree.AirBnb_filtered_data_cat, AirBnb_filtered_data_cat[-train,], type="class")

with(AirBnb_filtered_data_cat[-train,], table(tree.pred, price_cat))
##            price_cat
## tree.pred   Cheap Expensive
##   Cheap      1726       466
##   Expensive   133       508

On diagonal are the correct classifications while off the diagonal are incorrect classifications.

(1726 + 508)/2833
## [1] 0.7885634

We only get the correct ones that has an error of 0.78.

When developing a large, bushy tree, there may be too much variation. As a result, let’s utilise cross-validation to prune the tree as efficiently as possible. Use the misclassification error rate as the foundation for pruning using cv.tree.

cv.AirBnb_filtered_data_cat = cv.tree(tree.AirBnb_filtered_data_cat, FUN = prune.misclass)

cv.AirBnb_filtered_data_cat
## $size
## [1] 8 6 3 2 1
## 
## $dev
## [1] 1728 1728 1777 1981 2616
## 
## $k
## [1] -Inf    0   47  214  718
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.AirBnb_filtered_data_cat)

Because of the misclassification error on 2833 cross-validated points, you can notice a downward spiral segment of the plot. So, in the downward steps 8, let’s choose a value. Then, to identify that tree, let’s trim it down to a size of 8. Let’s plot and annotate the tree to see how it turns out.

prune.AirBnb_filtered_data_cat = prune.misclass(tree.AirBnb_filtered_data_cat, best = 8)
plot(prune.AirBnb_filtered_data_cat)
text(prune.AirBnb_filtered_data_cat, pretty=0)

It’s a bit shallower than previous trees, and you can actually read the labels. Let’s evaluate it on the test dataset again.

tree.pred = predict(prune.AirBnb_filtered_data_cat, AirBnb_filtered_data_cat[-train,], type="class")

with(AirBnb_filtered_data_cat[-train,], table(tree.pred, price_cat))
##            price_cat
## tree.pred   Cheap Expensive
##   Cheap      1726       466
##   Expensive   133       508

It has done about the same as your original tree, so pruning did not hurt much with respect to misclassification errors, and gave a simpler tree.

Naive Bayes

Splitting data into train and test data

split <- sample.split(AirBnb_filtered_data_cat, SplitRatio = 0.7)
train_cl <- subset(AirBnb_filtered_data_cat, split == "TRUE")
test_cl <- subset(AirBnb_filtered_data_cat, split == "FALSE")

Fitting Naive Bayes Model to training dataset

set.seed(12345) # Setting Seed
classifier_cl <- naiveBayes(price_cat ~ ., data = train_cl)
classifier_cl
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     Cheap Expensive 
## 0.6594495 0.3405505 
## 
## Conditional probabilities:
##            neighbourhood_cleansed
## Y           Bijlmer-Centrum Bijlmer-Oost Bos en Lommer Buitenveldert - Zuidas
##   Cheap         0.004173623  0.003617140   0.057039510            0.011964385
##   Expensive     0.000000000  0.000000000   0.017241379            0.004849138
##            neighbourhood_cleansed
## Y           Centrum-Oost Centrum-West De Aker - Nieuw Sloten
##   Cheap      0.089593767  0.119922092            0.006121313
##   Expensive  0.155711207  0.305495690            0.004849138
##            neighbourhood_cleansed
## Y           De Baarsjes - Oud-West De Pijp - Rivierenbuurt
##   Cheap                0.169170840             0.124930440
##   Expensive            0.113685345             0.108297414
##            neighbourhood_cleansed
## Y           Gaasperdam - Driemond Geuzenveld - Slotermeer
##   Cheap               0.001669449             0.010851419
##   Expensive           0.000000000             0.001616379
##            neighbourhood_cleansed
## Y           IJburg - Zeeburgereiland  Noord-Oost  Noord-West
##   Cheap                  0.011407902 0.007234279 0.011129661
##   Expensive              0.016702586 0.003232759 0.005926724
##            neighbourhood_cleansed
## Y           Oostelijk Havengebied - Indische Buurt      Osdorp   Oud-Noord
##   Cheap                                0.050639955 0.005008347 0.031441291
##   Expensive                            0.028556034 0.004310345 0.019935345
##            neighbourhood_cleansed
## Y              Oud-Oost Slotervaart Watergraafsmeer  Westerpark        Zuid
##   Cheap     0.065108514 0.024485253     0.023094046 0.097941013 0.073455760
##   Expensive 0.030711207 0.009159483     0.019935345 0.075431034 0.074353448
## 
##            property_type
## Y              Apartment Bed & Breakfast         Boat        Cabin    Camper/RV
##   Cheap     0.8336115748    0.0614913745 0.0217028381 0.0016694491 0.0022259321
##   Expensive 0.7532327586    0.0226293103 0.0759698276 0.0005387931 0.0000000000
##            property_type
## Y                 Chalet         Dorm  Earth House        House          Hut
##   Cheap     0.0000000000 0.0002782415 0.0002782415 0.0653867557 0.0000000000
##   Expensive 0.0000000000 0.0000000000 0.0000000000 0.1325431034 0.0000000000
##            property_type
## Y                   Loft        Other    Treehouse        Villa         Yurt
##   Cheap     0.0080690039 0.0041736227 0.0002782415 0.0002782415 0.0005564830
##   Expensive 0.0102370690 0.0026939655 0.0000000000 0.0021551724 0.0000000000
## 
##            room_type
## Y           Entire home/apt Private room Shared room
##   Cheap         0.720367279  0.273233166 0.006399555
##   Expensive     0.958512931  0.039870690 0.001616379
## 
##            accommodates
## Y               [,1]     [,2]
##   Cheap     2.578186 1.220081
##   Expensive 4.112608 2.113168
## 
##            bathrooms
## Y               [,1]      [,2]
##   Cheap     1.052898 0.3134883
##   Expensive 1.223556 0.4617200
## 
##            bedrooms
## Y               [,1]      [,2]
##   Cheap     1.138422 0.5289212
##   Expensive 1.931075 1.1183099
## 
##            beds
## Y               [,1]     [,2]
##   Cheap     1.537253 1.080841
##   Expensive 2.814620 2.125744
## 
##            bed_type
## Y                 Airbed        Couch        Futon Pull-out Sofa     Real Bed
##   Cheap     0.0022259321 0.0016694491 0.0044518642  0.0155815248 0.9760712298
##   Expensive 0.0005387931 0.0000000000 0.0010775862  0.0032327586 0.9951508621
## 
##            guests_included
## Y               [,1]      [,2]
##   Cheap     1.405398 0.7598203
##   Expensive 2.123922 1.5491060
## 
##            extra_people
## Y               [,1]     [,2]
##   Cheap     11.25042 15.88109
##   Expensive 18.11530 22.53947
## 
##            minimum_nights
## Y               [,1]     [,2]
##   Cheap     2.429883 1.935851
##   Expensive 2.617996 1.719424
## 
##            host_response_time
## Y               [,1]      [,2]
##   Cheap     3.732888 1.0553263
##   Expensive 3.808190 0.9734246
## 
##            host_response_rate
## Y               [,1]     [,2]
##   Cheap     76.45659 15.10461
##   Expensive 77.40733 13.56613
## 
##            number_of_reviews
## Y               [,1]     [,2]
##   Cheap     15.63077 27.93547
##   Expensive 10.10938 17.85257
## 
##            review_scores_rating
## Y               [,1]     [,2]
##   Cheap     92.95944 6.715819
##   Expensive 93.88201 7.003579
## 
##            review_scores_accuracy
## Y               [,1]      [,2]
##   Cheap     9.423933 0.7194781
##   Expensive 9.472038 0.7551881
## 
##            review_scores_cleanliness
## Y               [,1]      [,2]
##   Cheap     9.257155 0.8861725
##   Expensive 9.322351 0.8559437
## 
##            review_scores_checkin
## Y               [,1]      [,2]
##   Cheap     9.634670 0.6332843
##   Expensive 9.644268 0.7129074
## 
##            review_scores_communication
## Y               [,1]      [,2]
##   Cheap     9.692464 0.5823568
##   Expensive 9.710545 0.5672628
## 
##            review_scores_location
## Y               [,1]      [,2]
##   Cheap     9.206621 0.7844875
##   Expensive 9.438555 0.6902923
## 
##            review_scores_value
## Y               [,1]      [,2]
##   Cheap     9.025899 0.7930470
##   Expensive 9.056944 0.7862621
## 
##            AirBnb_filtered_data_cat.price_cat
## Y           Cheap Expensive
##   Cheap         1         0
##   Expensive     0         1

Predicting on test data’

y_pred <- predict(classifier_cl, newdata = test_cl)

Confusion Matrix

cm <- table(test_cl$price_cat, y_pred)
cm
##            y_pred
##             Cheap Expensive
##   Cheap      1518        39
##   Expensive    11       815

Model Evaluation

confusionMatrix(cm)
## Confusion Matrix and Statistics
## 
##            y_pred
##             Cheap Expensive
##   Cheap      1518        39
##   Expensive    11       815
##                                           
##                Accuracy : 0.979           
##                  95% CI : (0.9724, 0.9844)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.954           
##                                           
##  Mcnemar's Test P-Value : 0.0001343       
##                                           
##             Sensitivity : 0.9928          
##             Specificity : 0.9543          
##          Pos Pred Value : 0.9750          
##          Neg Pred Value : 0.9867          
##              Prevalence : 0.6416          
##          Detection Rate : 0.6370          
##    Detection Prevalence : 0.6534          
##       Balanced Accuracy : 0.9736          
##                                           
##        'Positive' Class : Cheap           
## 

Clustering

Principal Component Analysis

airbnb.pca <- prcomp(AirBnb[,c(13:16,18:31)], center = TRUE, scale. = TRUE)
summary(airbnb.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9555 1.8606 1.25032 1.05551 0.98530 0.96323 0.92902
## Proportion of Variance 0.2124 0.1923 0.08685 0.06189 0.05393 0.05154 0.04795
## Cumulative Proportion  0.2124 0.4048 0.49163 0.55352 0.60746 0.65900 0.70695
##                            PC8     PC9    PC10   PC11    PC12    PC13   PC14
## Standard deviation     0.89733 0.84511 0.79855 0.7348 0.70240 0.68436 0.6641
## Proportion of Variance 0.04473 0.03968 0.03543 0.0300 0.02741 0.02602 0.0245
## Cumulative Proportion  0.75169 0.79136 0.82679 0.8568 0.88420 0.91022 0.9347
##                           PC15    PC16    PC17    PC18
## Standard deviation     0.62137 0.56134 0.55417 0.40825
## Proportion of Variance 0.02145 0.01751 0.01706 0.00926
## Cumulative Proportion  0.95617 0.97368 0.99074 1.00000

Dropping varibales that will not be used in this model

main_data = select(AirBnb, -c(host_id,host_name,host_since_year,host_since_anniversary,id,city,state,zipcode,review_scores_accuracy, review_scores_cleanliness,review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value))
dim(main_data)
## [1] 7833   17

Creating new independent data variables for model

data_new_var <- main_data %>%
  mutate(bathroom_luxury = ifelse(bathrooms>0, accommodates/bathrooms,0),privacy = ifelse(bedrooms>0, beds/bedrooms,0))

K - means Clustering

Remove columns that will not be useful for clustering like price and country

clustering_data <- subset(data_new_var, select=-c(price,country))

Normalizing Function

normalize <- function(x){
  return ((x - min(x))/(max(x) - min(x)))
}

Normalizing Variables before analysis

names(clustering_data)
##  [1] "neighbourhood_cleansed" "property_type"          "room_type"             
##  [4] "accommodates"           "bathrooms"              "bedrooms"              
##  [7] "beds"                   "bed_type"               "guests_included"       
## [10] "extra_people"           "minimum_nights"         "host_response_time"    
## [13] "host_response_rate"     "number_of_reviews"      "review_scores_rating"  
## [16] "bathroom_luxury"        "privacy"
sapply(clustering_data, class)
## neighbourhood_cleansed          property_type              room_type 
##               "factor"               "factor"               "factor" 
##           accommodates              bathrooms               bedrooms 
##              "integer"              "numeric"              "numeric" 
##                   beds               bed_type        guests_included 
##              "numeric"               "factor"              "integer" 
##           extra_people         minimum_nights     host_response_time 
##              "integer"              "integer"              "integer" 
##     host_response_rate      number_of_reviews   review_scores_rating 
##              "integer"              "integer"              "numeric" 
##        bathroom_luxury                privacy 
##              "numeric"              "numeric"
clustering_data_norm = mutate(clustering_data, accom = normalize(accommodates), baths = normalize(bathrooms),
                              reviews_count = normalize(number_of_reviews), review_rating = normalize(review_scores_rating), bedroom_count=normalize(bedrooms),
                              bed_count=normalize(beds), bathrom_lux = normalize(bathroom_luxury), privacy=normalize(privacy))

clustering_data_norm1 = as.data.frame(clustering_data_norm)
clustering_data_norm2 = clustering_data_norm1 %>%
  cbind(acm.disjonctif(clustering_data_norm1[,c("bed_type","property_type","room_type","neighbourhood_cleansed","host_response_time")]))%>%ungroup()

Remove the variables that are coded.

clustering_data_norm3 = clustering_data_norm2 %>% 
  select(-property_type,-room_type,-bed_type,-neighbourhood_cleansed,-host_response_time)

Remove columns that were created for factor levels that were not represented in the sample.

clustering_data_norm4 <- clustering_data_norm3[, colSums(clustering_data_norm3!=0, na.rm =TRUE)>0]

Now run K-means and look at the within SSE Curve

SSE_curve <- c()
sum(is.na(clustering_data_norm4))
## [1] 0
for(n in 1:15){
  kcluster <- kmeans((clustering_data_norm4),n)
  sse <- sum(kcluster$withinss)
  SSE_curve[n] <- sse
}

SSE_curve
##  [1] 10023675  6694217  4829406  3812307  3382144  2617061  2346743  2249841
##  [9]  2030716  1941090  1772356  1745831  1505836  1533079  1425214

Elbow Method

print("SSE curve for ideal k value")
## [1] "SSE curve for ideal k value"
plot(1:15, SSE_curve, type="b", xlab="Number of clusters", ylab="SSE", main="Elbow Curve")

kcluster<- kmeans(clustering_data_norm4, 4)

print("The size of each clusters")
## [1] "The size of each clusters"
kcluster$size
## [1]  532 2397  882 4022
kcluster$centers
##   accommodates bathrooms bedrooms     beds guests_included extra_people
## 1     3.003759  1.076887 1.213186 1.855263        1.689850    16.246241
## 2     3.895286  1.183805 1.656858 2.505188        2.343763    35.354610
## 3     2.919501  1.085620 1.408163 1.871864        1.487528     8.164399
## 4     2.706862  1.081500 1.298832 1.714786        1.251119     1.510194
##   minimum_nights host_response_rate number_of_reviews review_scores_rating
## 1       2.062030           79.99248         89.979323             93.25000
## 2       2.415102           80.18982          9.956195             93.23614
## 3       2.798186           43.16327          6.558957             92.39633
## 4       2.560666           81.78319          7.666335             93.62522
##   bathroom_luxury    privacy     accom     baths reviews_count review_rating
## 1        2.834877 0.08995579 0.1335840 0.1346108    0.30296068     0.9156250
## 2        3.371624 0.09471770 0.1930191 0.1479757    0.03352254     0.9154518
## 3        2.729775 0.07640367 0.1279667 0.1357025    0.02208403     0.9049542
## 4        2.495843 0.07346016 0.1137908 0.1351875    0.02581258     0.9203152
##   bedroom_count  bed_count bathrom_lux bed_type.Airbed bed_type.Couch
## 1     0.1213186 0.05701754   0.1771798     0.003759398   0.0018796992
## 2     0.1656858 0.10034586   0.2107265     0.001251564   0.0008343763
## 3     0.1408163 0.05812425   0.1706109     0.000000000   0.0022675737
## 4     0.1298832 0.04765243   0.1559902     0.001989060   0.0014917951
##   bed_type.Futon bed_type.Pull-out Sofa bed_type.Real Bed
## 1    0.011278195            0.013157895         0.9699248
## 2    0.002920317            0.005840634         0.9891531
## 3    0.004535147            0.014739229         0.9784580
## 4    0.002237693            0.014917951         0.9793635
##   property_type.Apartment property_type.Bed & Breakfast property_type.Boat
## 1               0.7161654                    0.12406015        0.063909774
## 2               0.7801418                    0.03796412        0.063829787
## 3               0.8480726                    0.02494331        0.007936508
## 4               0.8157633                    0.04748881        0.033068125
##   property_type.Cabin property_type.Camper/RV property_type.Chalet
## 1        0.0056390977             0.000000000         0.0000000000
## 2        0.0004171882             0.001251564         0.0000000000
## 3        0.0000000000             0.001133787         0.0000000000
## 4        0.0019890602             0.001740428         0.0002486325
##   property_type.Dorm property_type.Earth House property_type.House
## 1        0.000000000              0.0000000000          0.07330827
## 2        0.000000000              0.0000000000          0.10179391
## 3        0.000000000              0.0000000000          0.10317460
## 4        0.000497265              0.0002486325          0.08378916
##   property_type.Hut property_type.Loft property_type.Other
## 1      0.0000000000        0.009398496         0.007518797
## 2      0.0000000000        0.009595327         0.002503129
## 3      0.0000000000        0.012471655         0.001133787
## 4      0.0002486325        0.009448036         0.004475385
##   property_type.Treehouse property_type.Villa property_type.Yurt
## 1            0.0000000000        0.0000000000       0.0000000000
## 2            0.0004171882        0.0012515645       0.0008343763
## 3            0.0000000000        0.0011337868       0.0000000000
## 4            0.0000000000        0.0009945301       0.0000000000
##   room_type.Entire home/apt room_type.Private room room_type.Shared room
## 1                 0.6654135              0.3345865           0.000000000
## 2                 0.8694201              0.1243221           0.006257822
## 3                 0.7743764              0.2176871           0.007936508
## 4                 0.7916459              0.2023869           0.005967181
##   neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
## 1                            0.001879699                         0.001879699
## 2                            0.001251564                         0.001251564
## 3                            0.000000000                         0.002267574
## 4                            0.004972650                         0.002734958
##   neighbourhood_cleansed.Bos en Lommer
## 1                           0.03007519
## 2                           0.04005006
## 3                           0.04308390
## 4                           0.04699155
##   neighbourhood_cleansed.Buitenveldert - Zuidas
## 1                                   0.001879699
## 2                                   0.007926575
## 3                                   0.014739229
## 4                                   0.012680259
##   neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
## 1                          0.16729323                           0.2951128
## 2                          0.12223613                           0.1989987
## 3                          0.08843537                           0.1519274
## 4                          0.11437096                           0.1636002
##   neighbourhood_cleansed.De Aker - Nieuw Sloten
## 1                                   0.003759398
## 2                                   0.006257822
## 3                                   0.004535147
## 4                                   0.005221283
##   neighbourhood_cleansed.De Baarsjes - Oud-West
## 1                                     0.1616541
## 2                                     0.1422612
## 3                                     0.1564626
## 4                                     0.1586275
##   neighbourhood_cleansed.De Pijp - Rivierenbuurt
## 1                                     0.09022556
## 2                                     0.10972048
## 3                                     0.11337868
## 4                                     0.12307310
##   neighbourhood_cleansed.Gaasperdam - Driemond
## 1                                  0.000000000
## 2                                  0.001668753
## 3                                  0.000000000
## 4                                  0.001491795
##   neighbourhood_cleansed.Geuzenveld - Slotermeer
## 1                                    0.005639098
## 2                                    0.005006258
## 3                                    0.018140590
## 4                                    0.006713078
##   neighbourhood_cleansed.IJburg - Zeeburgereiland
## 1                                      0.01315789
## 2                                      0.01126408
## 3                                      0.02040816
## 4                                      0.01218299
##   neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
## 1                       0.005639098                       0.007518797
## 2                       0.004589070                       0.008343763
## 3                       0.005668934                       0.006802721
## 4                       0.006961711                       0.010442566
##   neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
## 1                                                    0.02255639
## 2                                                    0.03504380
## 3                                                    0.06009070
## 4                                                    0.04699155
##   neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
## 1                   0.005639098                       0.01315789
## 2                   0.004589070                       0.02920317
## 3                   0.004535147                       0.02040816
## 4                   0.005718548                       0.02759821
##   neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
## 1                      0.02067669                         0.01503759
## 2                      0.05047977                         0.02503129
## 3                      0.07482993                         0.02040816
## 4                      0.05271009                         0.01392342
##   neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
## 1                             0.01127820                        0.07142857
## 2                             0.02503129                        0.09261577
## 3                             0.02494331                        0.08730159
## 4                             0.02262556                        0.08751865
##   neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
## 1                  0.05451128          0.003759398          0.007518797
## 2                  0.07717981          0.001251564          0.080517313
## 3                  0.08163265          0.201814059          0.000000000
## 4                  0.07284933          0.000000000          0.133018399
##   host_response_time.3 host_response_time.4 host_response_time.5
## 1            0.1860902           0.32706767           0.47556391
## 2            0.2027534           0.40467251           0.31080517
## 3            0.6984127           0.08390023           0.01587302
## 4            0.2076082           0.37991049           0.27946295

Adding a new column with the cluster assignment for each observation in the sample.

segment<-kcluster$cluster
clustering_data_norm5 <- cbind(clustering_data_norm4,segment)
head(clustering_data_norm5)
##   accommodates bathrooms bedrooms beds guests_included extra_people
## 1            4         2        2    2               4           10
## 2            2         1        1    2               1           10
## 3            4         1        1    1               2           25
## 4            2         1        1    1               1           10
## 5            6         1        2    2               2           25
## 6            4         1        1    1               2           25
##   minimum_nights host_response_rate number_of_reviews review_scores_rating
## 1              4                 65                11              98.0000
## 2              3                 85               108              97.0000
## 3              3                 85                15              92.0000
## 4              2                 85                20              97.0000
## 5              2                 74                 1             100.0000
## 6              2                 75                 0              93.3423
##   bathroom_luxury privacy      accom baths reviews_count review_rating
## 1               2  0.0625 0.20000000 0.250   0.037037037     0.9750000
## 2               2  0.1250 0.06666667 0.125   0.363636364     0.9625000
## 3               4  0.0625 0.20000000 0.125   0.050505051     0.9000000
## 4               2  0.0625 0.06666667 0.125   0.067340067     0.9625000
## 5               6  0.0625 0.33333333 0.125   0.003367003     1.0000000
## 6               4  0.0625 0.20000000 0.125   0.000000000     0.9167787
##   bedroom_count  bed_count bathrom_lux bed_type.Airbed bed_type.Couch
## 1           0.2 0.06666667       0.125               0              0
## 2           0.1 0.06666667       0.125               0              0
## 3           0.1 0.00000000       0.250               0              0
## 4           0.1 0.00000000       0.125               0              0
## 5           0.2 0.06666667       0.375               0              0
## 6           0.1 0.00000000       0.250               0              0
##   bed_type.Futon bed_type.Pull-out Sofa bed_type.Real Bed
## 1              0                      0                 1
## 2              0                      0                 1
## 3              0                      0                 1
## 4              0                      0                 1
## 5              0                      0                 1
## 6              0                      0                 1
##   property_type.Apartment property_type.Bed & Breakfast property_type.Boat
## 1                       1                             0                  0
## 2                       1                             0                  0
## 3                       1                             0                  0
## 4                       1                             0                  0
## 5                       1                             0                  0
## 6                       1                             0                  0
##   property_type.Cabin property_type.Camper/RV property_type.Chalet
## 1                   0                       0                    0
## 2                   0                       0                    0
## 3                   0                       0                    0
## 4                   0                       0                    0
## 5                   0                       0                    0
## 6                   0                       0                    0
##   property_type.Dorm property_type.Earth House property_type.House
## 1                  0                         0                   0
## 2                  0                         0                   0
## 3                  0                         0                   0
## 4                  0                         0                   0
## 5                  0                         0                   0
## 6                  0                         0                   0
##   property_type.Hut property_type.Loft property_type.Other
## 1                 0                  0                   0
## 2                 0                  0                   0
## 3                 0                  0                   0
## 4                 0                  0                   0
## 5                 0                  0                   0
## 6                 0                  0                   0
##   property_type.Treehouse property_type.Villa property_type.Yurt
## 1                       0                   0                  0
## 2                       0                   0                  0
## 3                       0                   0                  0
## 4                       0                   0                  0
## 5                       0                   0                  0
## 6                       0                   0                  0
##   room_type.Entire home/apt room_type.Private room room_type.Shared room
## 1                         1                      0                     0
## 2                         0                      1                     0
## 3                         1                      0                     0
## 4                         1                      0                     0
## 5                         1                      0                     0
## 6                         0                      1                     0
##   neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
## 1                                      0                                   0
## 2                                      0                                   0
## 3                                      0                                   0
## 4                                      0                                   0
## 5                                      0                                   0
## 6                                      0                                   0
##   neighbourhood_cleansed.Bos en Lommer
## 1                                    0
## 2                                    0
## 3                                    0
## 4                                    0
## 5                                    0
## 6                                    0
##   neighbourhood_cleansed.Buitenveldert - Zuidas
## 1                                             0
## 2                                             0
## 3                                             0
## 4                                             0
## 5                                             0
## 6                                             0
##   neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
## 1                                   0                                   0
## 2                                   0                                   0
## 3                                   0                                   0
## 4                                   1                                   0
## 5                                   0                                   1
## 6                                   0                                   1
##   neighbourhood_cleansed.De Aker - Nieuw Sloten
## 1                                             0
## 2                                             0
## 3                                             0
## 4                                             0
## 5                                             0
## 6                                             0
##   neighbourhood_cleansed.De Baarsjes - Oud-West
## 1                                             0
## 2                                             0
## 3                                             1
## 4                                             0
## 5                                             0
## 6                                             0
##   neighbourhood_cleansed.De Pijp - Rivierenbuurt
## 1                                              0
## 2                                              0
## 3                                              0
## 4                                              0
## 5                                              0
## 6                                              0
##   neighbourhood_cleansed.Gaasperdam - Driemond
## 1                                            0
## 2                                            0
## 3                                            0
## 4                                            0
## 5                                            0
## 6                                            0
##   neighbourhood_cleansed.Geuzenveld - Slotermeer
## 1                                              0
## 2                                              0
## 3                                              0
## 4                                              0
## 5                                              0
## 6                                              0
##   neighbourhood_cleansed.IJburg - Zeeburgereiland
## 1                                               0
## 2                                               0
## 3                                               0
## 4                                               0
## 5                                               0
## 6                                               0
##   neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
## 1                                 0                                 0
## 2                                 0                                 0
## 3                                 0                                 0
## 4                                 0                                 0
## 5                                 0                                 0
## 6                                 0                                 0
##   neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
## 1                                                             0
## 2                                                             1
## 3                                                             0
## 4                                                             0
## 5                                                             0
## 6                                                             0
##   neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
## 1                             0                                0
## 2                             0                                0
## 3                             0                                0
## 4                             0                                0
## 5                             0                                0
## 6                             0                                0
##   neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
## 1                               0                                  0
## 2                               0                                  0
## 3                               0                                  0
## 4                               0                                  0
## 5                               0                                  0
## 6                               0                                  0
##   neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
## 1                                      0                                 1
## 2                                      0                                 0
## 3                                      0                                 0
## 4                                      0                                 0
## 5                                      0                                 0
## 6                                      0                                 0
##   neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
## 1                           0                    0                    0
## 2                           0                    0                    0
## 3                           0                    0                    0
## 4                           0                    0                    0
## 5                           0                    0                    0
## 6                           0                    0                    0
##   host_response_time.3 host_response_time.4 host_response_time.5 segment
## 1                    1                    0                    0       4
## 2                    0                    0                    1       1
## 3                    0                    1                    0       2
## 4                    1                    0                    0       4
## 5                    1                    0                    0       2
## 6                    1                    0                    0       2
data_new_var <- as.data.frame(data_new_var)
segment <- data.frame(segment, col.names="segment")

Segment

airbnb_data_seg <- cbind(data_new_var,segment) 

Need to rename the column segment to cluster

airbnb_data_seg<-rename(airbnb_data_seg, cluster = segment)

Chart 1

cluster1 <- subset(airbnb_data_seg, subset = airbnb_data_seg$segment == 1)

ggplot(data = airbnb_data_seg, aes(x=room_type, fill = cluster))+geom_bar(stat="count",position=position_dodge())+
  facet_grid(airbnb_data_seg$cluster)+labs(x="Types of Rooms", y="Number of Rooms", title = "Distribution of various types of rooms across clusters")

Cluster 4 has the highest number of ‘Entire home/apt’ as compared to all the other clusters followed by cluster 2. The majority of ‘Private rooms’ are in cluster 4. Cluster 1, 2 and 4 has no shared rooms. Overall, there are more number of rooms of type ‘Entire home/apt’ followed by ‘Private rooms’

Chart 2

ggplot(data = airbnb_data_seg, aes(x=bedrooms, y=log(price), fill = cluster))+ 
  geom_point(color = "plum", shape=23)+
  geom_smooth(method = lm, se=FALSE)+
  facet_wrap(airbnb_data_seg$cluster)+
  labs(x="Number of bedrooms", y="Price", 
  title = "Relationship b/w price and number of bedrooms")
## `geom_smooth()` using formula 'y ~ x'

As the number of bedrooms increase, the log_price tends to increase. That is, there seems to exist a positive linear relationship between number of bedrooms and the log_price of the room

Chart 3

ggplot(data = airbnb_data_seg, aes(x=log(price),fill = cluster))+ 
  geom_histogram(bins=15)+
  facet_grid(airbnb_data_seg$cluster)+
  labs(x="Price", y="Number of Rooms", 
  title = "Price of Rooms")

The log_price of the rooms follows a normal distribution. The cheapest room exists in cluster 1.The most expensive room lies in cluster 4 Overall, rooms in cluster 4 are the most expensive, followed by rooms in cluster 4 and 2. The log_price of rooms in cluster 4 has the highest variance while the log_price of rooms in cluster 1 has the smallest variance

Hirerachial Clustering

Use R’s scale() function to scale all your column values

hirerachial_data_1 <- as.data.frame(scale(clustering_data_norm4))
summary(hirerachial_data_1)
##   accommodates       bathrooms          bedrooms            beds          
##  Min.   :-1.2032   Min.   :-2.8310   Min.   :-1.5980   Min.   :-0.595189  
##  1st Qu.:-0.6342   1st Qu.:-0.2873   1st Qu.:-0.4686   1st Qu.:-0.595189  
##  Median :-0.6342   Median :-0.2873   Median :-0.4686   Median :-0.595189  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.5038   3rd Qu.:-0.2873   3rd Qu.: 0.6608   3rd Qu.: 0.009747  
##  Max.   : 7.3317   Max.   :17.5185   Max.   : 9.6960   Max.   : 8.478852  
##  guests_included    extra_people     minimum_nights    host_response_rate
##  Min.   :-1.4338   Min.   :-0.7201   Min.   :-0.7949   Min.   :-5.1977   
##  1st Qu.:-0.5605   1st Qu.:-0.7201   1st Qu.:-0.7949   1st Qu.:-0.1251   
##  Median :-0.5605   Median :-0.7201   Median :-0.2681   Median : 0.5604   
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.3127   3rd Qu.: 0.6019   3rd Qu.: 0.2587   3rd Qu.: 0.5604   
##  Max.   :12.5382   Max.   :11.7064   Max.   :12.9018   Max.   : 0.6289   
##  number_of_reviews  review_scores_rating bathroom_luxury      privacy       
##  Min.   :-0.54296   Min.   :-10.9982     Min.   :-2.1029   Min.   :-1.4907  
##  1st Qu.:-0.50371   1st Qu.: -0.2013     1st Qu.:-0.6079   1st Qu.:-0.3464  
##  Median :-0.34670   Median :  0.0000     Median :-0.6079   Median :-0.3464  
##  Mean   : 0.00000   Mean   :  0.0000     Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04581   3rd Qu.:  0.6985     3rd Qu.: 0.8871   3rd Qu.: 0.3402  
##  Max.   :11.11471   Max.   :  0.9984     Max.   : 9.8571   Max.   :16.8183  
##      accom             baths         reviews_count      review_rating     
##  Min.   :-1.2032   Min.   :-2.8310   Min.   :-0.54296   Min.   :-10.9982  
##  1st Qu.:-0.6342   1st Qu.:-0.2873   1st Qu.:-0.50371   1st Qu.: -0.2013  
##  Median :-0.6342   Median :-0.2873   Median :-0.34670   Median :  0.0000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   :  0.0000  
##  3rd Qu.: 0.5038   3rd Qu.:-0.2873   3rd Qu.: 0.04581   3rd Qu.:  0.6985  
##  Max.   : 7.3317   Max.   :17.5185   Max.   :11.11471   Max.   :  0.9984  
##  bedroom_count       bed_count          bathrom_lux      bed_type.Airbed   
##  Min.   :-1.5980   Min.   :-0.595189   Min.   :-2.1029   Min.   :-0.04077  
##  1st Qu.:-0.4686   1st Qu.:-0.595189   1st Qu.:-0.6079   1st Qu.:-0.04077  
##  Median :-0.4686   Median :-0.595189   Median :-0.6079   Median :-0.04077  
##  Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6608   3rd Qu.: 0.009747   3rd Qu.: 0.8871   3rd Qu.:-0.04077  
##  Max.   : 9.6960   Max.   : 8.478852   Max.   : 9.8571   Max.   :24.52472  
##  bed_type.Couch    bed_type.Futon    bed_type.Pull-out Sofa bed_type.Real Bed
##  Min.   :-0.0375   Min.   :-0.0577   Min.   :-0.1102        Min.   :-7.3068  
##  1st Qu.:-0.0375   1st Qu.:-0.0577   1st Qu.:-0.1102        1st Qu.: 0.1368  
##  Median :-0.0375   Median :-0.0577   Median :-0.1102        Median : 0.1368  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000        Mean   : 0.0000  
##  3rd Qu.:-0.0375   3rd Qu.:-0.0577   3rd Qu.:-0.1102        3rd Qu.: 0.1368  
##  Max.   :26.6646   Max.   :17.3272   Max.   : 9.0730        Max.   : 0.1368  
##  property_type.Apartment property_type.Bed & Breakfast property_type.Boat
##  Min.   :-2.0108         Min.   :-0.2226               Min.   :-0.2087   
##  1st Qu.: 0.4973         1st Qu.:-0.2226               1st Qu.:-0.2087   
##  Median : 0.4973         Median :-0.2226               Median :-0.2087   
##  Mean   : 0.0000         Mean   : 0.0000               Mean   : 0.0000   
##  3rd Qu.: 0.4973         3rd Qu.:-0.2226               3rd Qu.:-0.2087   
##  Max.   : 0.4973         Max.   : 4.4908               Max.   : 4.7907   
##  property_type.Cabin property_type.Camper/RV property_type.Chalet
##  Min.   :-0.03917    Min.   :-0.0375         Min.   :-0.0113     
##  1st Qu.:-0.03917    1st Qu.:-0.0375         1st Qu.:-0.0113     
##  Median :-0.03917    Median :-0.0375         Median :-0.0113     
##  Mean   : 0.00000    Mean   : 0.0000         Mean   : 0.0000     
##  3rd Qu.:-0.03917    3rd Qu.:-0.0375         3rd Qu.:-0.0113     
##  Max.   :25.52776    Max.   :26.6646         Max.   :88.4929     
##  property_type.Dorm property_type.Earth House property_type.House
##  Min.   :-0.01598   Min.   :-0.0113           Min.   :-0.3159    
##  1st Qu.:-0.01598   1st Qu.:-0.0113           1st Qu.:-0.3159    
##  Median :-0.01598   Median :-0.0113           Median :-0.3159    
##  Mean   : 0.00000   Mean   : 0.0000           Mean   : 0.0000    
##  3rd Qu.:-0.01598   3rd Qu.:-0.0113           3rd Qu.:-0.3159    
##  Max.   :62.56996   Max.   :88.4929           Max.   : 3.1647    
##  property_type.Hut property_type.Loft property_type.Other
##  Min.   :-0.0113   Min.   :-0.09963   Min.   :-0.06096   
##  1st Qu.:-0.0113   1st Qu.:-0.09963   1st Qu.:-0.06096   
##  Median :-0.0113   Median :-0.09963   Median :-0.06096   
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   
##  3rd Qu.:-0.0113   3rd Qu.:-0.09963   3rd Qu.:-0.06096   
##  Max.   :88.4929   Max.   :10.03566   Max.   :16.40333   
##  property_type.Treehouse property_type.Villa property_type.Yurt
##  Min.   :-0.0113         Min.   :-0.03197    Min.   :-0.01598  
##  1st Qu.:-0.0113         1st Qu.:-0.03197    1st Qu.:-0.01598  
##  Median :-0.0113         Median :-0.03197    Median :-0.01598  
##  Mean   : 0.0000         Mean   : 0.00000    Mean   : 0.00000  
##  3rd Qu.:-0.0113         3rd Qu.:-0.03197    3rd Qu.:-0.01598  
##  Max.   :88.4929         Max.   :31.27299    Max.   :62.56996  
##  room_type.Entire home/apt room_type.Private room room_type.Shared room
##  Min.   :-2.0312           Min.   :-0.483         Min.   :-0.07685     
##  1st Qu.: 0.4923           1st Qu.:-0.483         1st Qu.:-0.07685     
##  Median : 0.4923           Median :-0.483         Median :-0.07685     
##  Mean   : 0.0000           Mean   : 0.000         Mean   : 0.00000     
##  3rd Qu.: 0.4923           3rd Qu.:-0.483         3rd Qu.:-0.07685     
##  Max.   : 0.4923           Max.   : 2.070         Max.   :13.01003     
##  neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
##  Min.   :-0.05543                       Min.   :-0.04663                   
##  1st Qu.:-0.05543                       1st Qu.:-0.04663                   
##  Median :-0.05543                       Median :-0.04663                   
##  Mean   : 0.00000                       Mean   : 0.00000                   
##  3rd Qu.:-0.05543                       3rd Qu.:-0.04663                   
##  Max.   :18.03700                       Max.   :21.44076                   
##  neighbourhood_cleansed.Bos en Lommer
##  Min.   :-0.2127                     
##  1st Qu.:-0.2127                     
##  Median :-0.2127                     
##  Mean   : 0.0000                     
##  3rd Qu.:-0.2127                     
##  Max.   : 4.7014                     
##  neighbourhood_cleansed.Buitenveldert - Zuidas
##  Min.   :-0.1041                              
##  1st Qu.:-0.1041                              
##  Median :-0.1041                              
##  Mean   : 0.0000                              
##  3rd Qu.:-0.1041                              
##  Max.   : 9.6041                              
##  neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
##  Min.   :-0.3648                     Min.   :-0.4717                    
##  1st Qu.:-0.3648                     1st Qu.:-0.4717                    
##  Median :-0.3648                     Median :-0.4717                    
##  Mean   : 0.0000                     Mean   : 0.0000                    
##  3rd Qu.:-0.3648                     3rd Qu.:-0.4717                    
##  Max.   : 2.7410                     Max.   : 2.1195                    
##  neighbourhood_cleansed.De Aker - Nieuw Sloten
##  Min.   :-0.07342                             
##  1st Qu.:-0.07342                             
##  Median :-0.07342                             
##  Mean   : 0.00000                             
##  3rd Qu.:-0.07342                             
##  Max.   :13.61897                             
##  neighbourhood_cleansed.De Baarsjes - Oud-West
##  Min.   :-0.4259                              
##  1st Qu.:-0.4259                              
##  Median :-0.4259                              
##  Mean   : 0.0000                              
##  3rd Qu.:-0.4259                              
##  Max.   : 2.3474                              
##  neighbourhood_cleansed.De Pijp - Rivierenbuurt
##  Min.   :-0.3616                               
##  1st Qu.:-0.3616                               
##  Median :-0.3616                               
##  Mean   : 0.0000                               
##  3rd Qu.:-0.3616                               
##  Max.   : 2.7649                               
##  neighbourhood_cleansed.Gaasperdam - Driemond
##  Min.   :-0.03575                            
##  1st Qu.:-0.03575                            
##  Median :-0.03575                            
##  Mean   : 0.00000                            
##  3rd Qu.:-0.03575                            
##  Max.   :27.96784                            
##  neighbourhood_cleansed.Geuzenveld - Slotermeer
##  Min.   :-0.08636                              
##  1st Qu.:-0.08636                              
##  Median :-0.08636                              
##  Mean   : 0.00000                              
##  3rd Qu.:-0.08636                              
##  Max.   :11.57733                              
##  neighbourhood_cleansed.IJburg - Zeeburgereiland
##  Min.   :-0.1143                                
##  1st Qu.:-0.1143                                
##  Median :-0.1143                                
##  Mean   : 0.0000                                
##  3rd Qu.:-0.1143                                
##  Max.   : 8.7490                                
##  neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
##  Min.   :-0.07769                  Min.   :-0.09631                 
##  1st Qu.:-0.07769                  1st Qu.:-0.09631                 
##  Median :-0.07769                  Median :-0.09631                 
##  Mean   : 0.00000                  Mean   : 0.00000                 
##  3rd Qu.:-0.07769                  3rd Qu.:-0.09631                 
##  Max.   :12.87006                  Max.   :10.38161                 
##  neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
##  Min.   :-0.2123                                              
##  1st Qu.:-0.2123                                              
##  Median :-0.2123                                              
##  Mean   : 0.0000                                              
##  3rd Qu.:-0.2123                                              
##  Max.   : 4.7087                                              
##  neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
##  Min.   :-0.07253              Min.   :-0.1643                 
##  1st Qu.:-0.07253              1st Qu.:-0.1643                 
##  Median :-0.07253              Median :-0.1643                 
##  Mean   : 0.00000              Mean   : 0.0000                 
##  3rd Qu.:-0.07253              3rd Qu.:-0.1643                 
##  Max.   :13.78494              Max.   : 6.0844                 
##  neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
##  Min.   :-0.235                  Min.   :-0.1359                   
##  1st Qu.:-0.235                  1st Qu.:-0.1359                   
##  Median :-0.235                  Median :-0.1359                   
##  Mean   : 0.000                  Mean   : 0.0000                   
##  3rd Qu.:-0.235                  3rd Qu.:-0.1359                   
##  Max.   : 4.255                  Max.   : 7.3590                   
##  neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
##  Min.   :-0.1529                        Min.   :-0.3105                  
##  1st Qu.:-0.1529                        1st Qu.:-0.3105                  
##  Median :-0.1529                        Median :-0.3105                  
##  Mean   : 0.0000                        Mean   : 0.0000                  
##  3rd Qu.:-0.1529                        3rd Qu.:-0.3105                  
##  Max.   : 6.5387                        Max.   : 3.2198                  
##  neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
##  Min.   :-0.2825             Min.   :-0.1547      Min.   :-0.321      
##  1st Qu.:-0.2825             1st Qu.:-0.1547      1st Qu.:-0.321      
##  Median :-0.2825             Median :-0.1547      Median :-0.321      
##  Mean   : 0.0000             Mean   : 0.0000      Mean   : 0.000      
##  3rd Qu.:-0.2825             3rd Qu.:-0.1547      3rd Qu.:-0.321      
##  Max.   : 3.5393             Max.   : 6.4651      Max.   : 3.114      
##  host_response_time.3 host_response_time.4 host_response_time.5
##  Min.   :-0.5926      Min.   :-0.7347      Min.   :-0.6123     
##  1st Qu.:-0.5926      1st Qu.:-0.7347      1st Qu.:-0.6123     
##  Median :-0.5926      Median :-0.7347      Median :-0.6123     
##  Mean   : 0.0000      Mean   : 0.0000      Mean   : 0.0000     
##  3rd Qu.: 1.6873      3rd Qu.: 1.3610      3rd Qu.: 1.6330     
##  Max.   : 1.6873      Max.   : 1.3610      Max.   : 1.6330

Notice that means of all the attributes are zero and standard deviation is equal to one.

All the values here are continuous numerical values, here we will use the euclidean distance method.

hirerachial_data_2 <- dist(hirerachial_data_1, method = 'euclidean')

Applying Linkage Method

hirerachial_data_3 <- hclust(hirerachial_data_2, method = "ward.D2")

Plot the hierarchical clustering

plot(hirerachial_data_3, hang=-1, cex=0.7)

Set the K value to 3 (clusters) and plot

If you visually want to see the clusters on the dendrogram you can use R’s abline() function to draw the cut line and superimpose rectangular compartments for each cluster on the tree with the rect.hclust() function as shown in the following code:

k_hirerachina_data_3 <- cutree(hirerachial_data_3, k = 4)

plot(hirerachial_data_3)
rect.hclust(hirerachial_data_3 , k = 4, border = 2:6)
abline(h = 4, col = 'red')

Now we can see the three clusters enclosed in three different colored boxes. We can also use the color_branches() function from the dendextend library to visualize our tree with different colored branches.

suppressPackageStartupMessages(library(dendextend))
avg_dend_obj <- as.dendrogram(hirerachial_data_3)
avg_col_dend <- color_branches(avg_dend_obj, h = 4)
plot(avg_col_dend)

Now we will append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.

suppressPackageStartupMessages(library(dplyr))
hirerachial_c1 <- mutate(hirerachial_data_1, cluster = k_hirerachina_data_3)
count(hirerachial_c1,cluster)
##   cluster    n
## 1       1 5878
## 2       2  141
## 3       3  570
## 4       4 1244

It’s common to evaluate the trend between two features based on the clustering that you did in order to extract more useful insights from the data cluster-wise.

suppressPackageStartupMessages(library(ggplot2))
ggplot(hirerachial_c1, aes(x=beds, y = bedrooms, color = factor(cluster))) + geom_point()